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Inverse optimization consists of determining unknown parameters of an optimization problem based on 
knowledge of its optimal points. This paper considers inverse optimization in the setting where measure¬ 
ments of the optimal points of a convex optimization problem are corrupted by noise. We first provide a 
formulation for the inverse optimization problem with noisy data and show it is NP-hard, and then we prove 
that existing convex optimization-based heuristics for inverse optimization with noisy data are statistically 
inconsistent (meaning the answers provided by these methods generally converge to the “wrong” answer as 
more data is collected). Next, we show that our formulation is statistically consistent by combining a new 
duality-based reformulation for bilevel programs with a regularization scheme that smooths discontinuities 
in our formulation. Using epi-convergence theory, we show the regularization parameter can be adjusted to 
approximate the original inverse optimization problem to arbitrary accuracy, and this is used to prove our 
statistical consistency results. This duality-based reformulation is next used to propose two numerical algo¬ 
rithms for solving the inverse optimization problem with noisy data. The hrst is an enumeration algorithm 
that is applicable to settings where the dimensionality of the parameter space is modest, and the second is a 
semiparametric approach that combines nonparametric statistics with a modified version of our formulation 
of the inverse optimization problem. These numerical algorithms are shown to maintain the statistical con¬ 
sistency of our formulation. Lastly, we demonstrate using synthetic and real data sets the competitiveness 
of our numerical algorithms as compared to existing heuristics. 
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1. Introduction 

An appreciable share of real-world data represents decisions, which can often be characterized as 
the solutions of correspondingly-defined optimization problems. Estimating the parameters of these 
latent optimization problems has the potential to provide greater insight into how decisions are 
made, and also enable the prediction of future decisions. Examples of domains where this is impor¬ 
tant include health systems engineering ( Aswani et al.|[2C)16 ), energy systems engineering (?), and 
marketing (?), where such estimation may lead to new approaches that enable the individualization 
of products and incentives. 

For example, consider a single homeowner who each day observes an electricity price and weather 
forecast and then adjusts the temperature set-point for their home’s air-conditioner. By model¬ 
ing this homeowner’s decision as being generated from an optimization problem, we can directly 
estimate the price elasticity of comfort - as measured by a standardized function of the temper¬ 
ature set-point and the outside temperature (ASHRAE 2013) - for this particular homeowner. 
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This information is valuable for designing personalized incentive bonus schemes that encourage 


participation in demand-response programs (?) or promote energy-efficiency (Aswani and Tomlin 

2M2|. 


1.1. Overview 

This paper considers the problem of estimating unknown model parameters of an optimization 
problem based on noisy measurements of optimal solutions of this optimization problem. We 
broadly call this estimation process: inverse optimization with noisy data. In particular, the novelty 
of our approach is to provide the first statistical inference perspective on the inverse optimization 
problem. This is important because real-world decision data is noisy, either because (i) the data 
collection process introduces measurement noise, (ii) the decision-maker deviates from optimal 
decisions - phenomena often referred to as bounded rationality ( Tversky and Kahneman||1981 ), or 
(iii) there is mismatch between the equations of the model and the actual decision-making process. 

Noisy data make inverse optimization challenging because noise in the solution data can preclude 
the existence of a single set of model parameters that renders all observed solutions exactly optimal. 
In this setting, the goal of inverse optimization is to find a set of model parameters that achieves a 
good “fit” with respect to the solution data. More specifically, we are interested in two statistical 
questions. First, how can we generate estimates of unknown model parameters that asymptotically 
provide the best possible predictions from the chosen equation for the model? In statistics, this 
property is known as risk consistency (Bartlett and Mendelson]|2002 Greenshtein and Ritov|[2004 


Chatterjee|2014 ). Second, when the chosen equation for the model matches that which is generating 


the solution data, how can we generate estimates that asymptotically converge to the true value of 
the unknown parameters? In statistics, this property is known as consistency (Wald|1949, Jennrich 


1969| Bickel and Doksum 2006). We will use the term estimation consistency to distinguish this 


concept from risk consistency. Note that estimation consistency generally implies risk consistency. 

Restated, a risk consistent estimate asymptotically achieves the lowest possible prediction error 
(out of all possible predictions permitted by the class of models considered). Hence, risk consis¬ 
tency and estimation consistency allow us to be confident that prediction and estimation accuracy, 
respectively, will generally improve with additional data. By contrast, an estimator that fails to 
be risk consistent (so-called inconsistent estimators) may yield poor predictions, even if a large 
amount of data is available. Proving consistency of an estimator is an important topic in the theory 


of statistical inference (cf. (Wald 1949, Jennrich 1969, Bartlett and Mendelson 2002, Greenshtein 


and Ritov 2004, Bickel and Doksum 2006, Chatterjee 2014, Aswani 2015)), and consistency is 


considered to be a minimal requirement for an estimator (Bickel and Doksum 2006). 

The main paper begins with Section which describes the statistical and computational chal¬ 
lenges of inverse optimization with noisy data. The section begins by formally defining a (convex) 


































































3 


forward optimization problem and its corresponding inverse optimization problem. We specifically 
formulate the inverse optimization problem such that (as we later show) its solution has the desired 
statistical consistency properties. Our approach is conceptually similar to least squares regres¬ 
sion in the sense that we also employ a sum-of-squares loss function to fit a parametric model to 
noisy data. The substantive difference is that inverse optimization involves estimating the (possibly 
multi-valued) solution set of a general convex optimization problem, whereas regression typically 
involves estimating a (single-valued) function which has a closed form expression. Due to these 


differences, much of the classical statistical theory on least-squares regression (Jennrich 1969) is 
invalid in the inverse optimization setting, and thus new analysis is required. We also note that 
our approach is not restricted to the use of an l-i norm: Results similar to those in our paper can 
be proved for other loss functions, such as absolute deviation or a likelihood function, but we do 
not consider those extensions in this paper. 

In Section we study the statistical consistency of our formulation of the inverse optimization 
problem. The key technical difficulty in proving these results is dealing with continuity issues. In 
particular, the risk measures are not continuous in the general case, but are rather lower semicon- 
tinuous. As alluded to above, this precludes the use of the typical statistical machinery used to 


prove consistency results (namely the uniform law of large numbers (Jennrich 1969) and related 
uniform bounds (Bartlett and Mendelson 2002, Greenshtein and Ritov |2004 )). To circumvent this 
difficulty, we dehne a regularized version of the inverse optimization problem that smooths out 
any discontinuities, and this regularized version of the problem is constructed using a new duality- 
based reformulation for bilevel programs. Using epi-convergence theory, we show the regularization 
parameter can be adjusted to approximate the original inverse optimization problem to arbitrary 
accuracy. The section concludes by using the regularized version of the inverse optimization prob¬ 
lem to prove results on the statistical consistency of our formulation. 

Section provides two numerical algorithms for solving our formulation of the inverse opti¬ 
mization problem. The first numerical algorithm is an enumeration algorithm that is applicable to 
settings where the dimensionality of the parameter space is modest (i.e., at most four or five param¬ 
eters). The second numerical algorithm is a semiparametric approach that combines nonparametric 
statistics with a modified version of our formulation of the inverse optimization problem. The sta¬ 
tistical consistency of these two numerical algorithms are shown using the results from Section 
Lastly, in Section we demonstrate using synthetic and real data sets the competitiveness of our 
approaches as compared to existing heuristics (Keshavarz et al. ]|20II[ [Bertsimas et al.||20I4[ ). 

1.2. Literature Review 

Existing inverse optimization models differ based on their specification of the loss function, and 
the different models can be broadly categorized into either (i) deterministic settings, or (ii) noisy 
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settings. The work in the deterministic setting has primarily focused on single observation situa¬ 
tions, wherein a single optimal solution is observed and then nsed to estimate parameters of the 
optimization problem. However, in the noisy setting past work has considered situations with either 
a single observation and multiple observations. 


We begin by describing some of the work in the deterministic setting: Ahuja and Orlin (2001) 
consider the estimation of objective function coefficients of general linear programs given a single 
optimal solution. The feasible region of the inverse problem is formulated using the constraints of 
the dual program and complemetary slackness conditions. Since the observed solution is assumed 


to be optimal, feasibility of the inverse problem is guaranteed. Iyengar and Kang (2005) and Zhang 


and Xu (|2010) extend inverse optimization to certain conic forward problems using conic duality 


theory. Inverse optimization models have also been studied in the context of integer programs 
( Schaefer[|2009 , Wang|20M ) and network problems ( Burton and Toint[|1992 , Hochbaum|2003 , Zhang 


and Liu 1996). With respect to applications, inverse optimization models has been employed in 


many different domains, including healthcare ( Erkin et al.||2010 Chan et'n^|2014 ), energy (Ratliff 
eTalllMI , |Saez-Gallego et aL]|2016D , finance ([Bertsimas et al.||2012[) , production planning ([Troutt 


et al.||2006), demand management (Carr and Lovejo^|2000, Bajari et al.|2007), auction design (Beil 


and Wein||2003 ), telecommunication (Farago et al. 2003) and geoscience (Bnrton and Toin^|1992). 


We refer the reader to Heuberger (2004) for a snrvey of inverse optimization methods. 


The noisy setting has been less studied. Chan et al. (2014) propose a generalized approach to 
inverse optimization for linear programs where the (single) observed solution may be suboptimal or 
infeasible. Instead of complementary slackness, the authors use dual feasibility and strong duality 
to formulate the inverse problem. To accommodate noise, the strong duality constraint is relaxed 


to guarantee feasibility of the inverse problem. Saez-Gallego et al. (2016) also consider inverse opti¬ 


mization for linear programs, and formulate the inverse problem using KKT conditions. [Keshavarz 


et al. (]2011 ) formulates the inverse problem using the KKT conditions of the optimization problem. 


To accommodate noise, the KKT conditions are relaxed by introducing slack variables to allow the 


data to “approximately” satisfy the KKT conditions. Similarly, Bertsimas et al. (2014) consider 
inverse problems where the observed data are assumed to be in an equilibrinm. The authors enforce 
optimality conditions using a variational inequality, and similarly relax the optimality conditions by 
introducing slack variables to allow the data to “approximately” satisfy the variational inequality. 

Our work in this paper is most closely related to the noisy setting with multiple observations 


that has been previously considered by Keshavarz et al. (2011) and Bertsimas et al. (2014). The key 
distinction between onr work and these two previons approaches is in the choice of the loss fnnction. 
In (Keshavarz et al. 2011) and ( Bertsimas et al.|2014 ), the loss function is measured by the amount 
of slack required to make the measured data satisfy an approximate optimality condition (either the 
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KKT conditions (Keshavarz et al. 2011) or a variational inequality describing optimality (Bertsimas 


et al.|[2M4|). In contrast, our approach is to jointly estimate (i) the parameters of the optimization 


problem, and (ii) the denoised versions of the measured data (i.e. the true underlying optimal 
solutions). By performing this joint estimation, we are able to define our loss function to be the 
average discrepancy between the measured data and the (estimated) denoised data. As we will show, 
this difference in loss function leads to signihcantly improved statistical performance. A secondary 
distinction is that we propose the use of a novel optimality condition: specifically, we upper bound 
the objective function of a convex optimization problem by its dual - thereby enforcing a zero 
duality gap and guaranteeing optimality. An important benefit of using this alternate optimality 
condition is that it has favorable convexity and continuity properties (which are not available when 
using KKT conditions or variational inequalities to represent optimality) that enable design of 
numerical algorithms for solving the inverse optimization problem. 

1.3. Contributions 

Our contributions in this paper include both statistical and optimization results, and there are 
specifically two main contributions. The first is we show that solving a bilevel formulation for the 
problem of inverse optimization with noisy data provides parameter estimates that are statistically 
consistent. This statistical result is independent of the approach used to solve the bilevel formula¬ 
tion. Our second main contribution is to propose two numerical algorithms for solving the bilevel 
formulation by using a novel duality-based reformulation. However, other numerical algorithms can 
be used to solve the bilevel formulation. For instance, the bilevel program can be reformulated as 


a mixed-integer quadratic program (MIQP) in some cases (Jose Fortuny-Amat 1981, Audet et al. 


1997|. Our statistical results apply to any numerical algorithm for solving the bilevel formulation. 


including the MIQP reformulation (when possible) or our two algorithms. 

We also prove that existing heuristics for inverse optimization with noisy data (]Keshavarz et aL 


2011, Bertsimas et al.|2014), which are expressed as convex optimization problems, are statistically 


inconsistent - meaning that in the limit of increasing amount of data these approaches will generate 
parameter estimates that converge to incorrect values. This is perhaps not unexpected, because 
we also prove that the problem of inverse optimization with noisy data is NP-hard. It should be 
noted that the inverse optimization problem without noisy data can be solved in polynomial time. 


as shown by Keshavarz et al. (2011) and Bertsimas et al. (2014). 

An additional contribution is we propose a novel reformulation of bilevel programs where there 
lower level optimization problem is convex. It is common to replace the lower level problem by 


the KKT conditions or to upper bound the objective function by the value function (Dempe et al. 


2015). However, these approaches face certain numerical difhculties. We propose to upper bound 
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the objective function by its dual, which enforces a zero duality gap and describes an optimal point. 
The benefit of our optimality condition is it has convexity and continuity properties that support 
the design of numerical algorithms. The two numerical algorithms we propose directly make use of 
this optimality condition, and the proofs of our statistical results are also aided by the use of this 
optimality condition. 


1.4. Notation 

Most notation we use in this paper is standard, and we briefly summarize some of the less usual 
aspects of our notation. We use || • || to denote the usual £ 2 -norm. The indicator function l(u) is 
defined to be 


t{u) = 


( 1 ) 


1, if condition u is satisfied 
0, otherwise 

V 

When u is a vector and A = {oi, 02 ,...} is a set, the notation (u)^ refers to the vector formed by 
the components of u with indices given in A. The notation [r] = {1,... ,r} refers to sequential set. 
The Kuratowski limit superior of a sequence of sets C is defined as 


limsup^,(C^) = {x G ; liminf dist(x,C,,) = 0}, 


( 2 ) 


where dist(x,C) = inf{||x — c|| | cGC}. We similarly define dist(.B,C) = inf{dist(x,C) | xeB}. 

2 . Challenges with Noisy Inverse Optimization 

This section begins by formalizing the notation for the forward problem, before defining the noisy 
inverse optimization problem. For the case where we have access to measurements (rather than 
the underlying distributions), we formulate a related sample average approximation of the inverse 
optimization problem. We show that both these inverse problems are NP-hard. We conclude by 
showing that existing heuristic approaches for solving the inverse optimization problem are statis¬ 
tically inconsistent, meaning that in the limit of infinite data these heuristic approaches converge 
to incorrect solutions. 

2.1. Model for Forward Problem 

Let X G be the decision variable, u G be the external input variable, and 0 G be the 
parameter vector. Then the forward optimization problem is given by 

FOP min{/(x,u,0) | g(x,u,0) <0j, 

where / : x M™ x —)■ M is a function and g :R^ x x —)■ is a vector-valued function. The 

solution set of FOP is the set-valued function given by S(u, 9) = argmin 2 ,{/(x, u, 9) \ g{x, u, 9) < 0}. 
The value function of FOP is given by V{u,9) = m.in^{f{x,u,9) \ 9{x,u,9) < 0}, and the feasible 
set is defined as <l>(u, 0) = {x G : g{x, u, 0) < O}. 
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2.2. Model for Inverse Optimization Problem 

Suppose (n, y) G M’” x is a vector-valued random variable that is distributed according to some 
unknown but fixed joint distribution P(u,y). Let U x y C be the support of this distribution, 
meaning the smallest set that satishes the property = 1. If we define the function 

RISK Q(9) =e( min lly —xlpV 

\ x^S{u,6) / 

then the inverse optimization problem is given by 

lOP min{Q(0) I 6»G0}, 

where 0 C Rp is a known set. We make the following assumptions: 

Al. The functions f{x,u,9),g{x,u,9) are continuous and convex in x, for fixed u,9. 

A2. The set 0 is convex. 


These assumptions are fairly mild. Al is equivalent to stating FOP is a convex optimization 
problem, and A2 is asserting that the set of possible 9 is convex. 

When the joint distribution is unknown, we cannot solve lOP without additional infor¬ 

mation. Fortunately, we can leverage the iid measurements {ui,yi) for i£ [n]. In the context of a 
decision-making agent, we should interpret the (i) Ui as an external signal the agent responds to, 
and (ii) yi as a measurement of the corresponding decision of the agent. In principle, we can solve 
lOP using a sample average approximation: 

lOP-SAA min{Q„(0) | 0G0}, 


where 


RISK SAA 


Qni9) = min 

Xi 



i=l 


s.t. Xi G S{ui,9), 


yi G [n] 


2.3. NP-Hardness of Inverse Optimization Problem 

Though all the functions and sets involved in FOP and lOP are convex, solving lOP is NP-hard. 
Theorem 1. If A1,A2 hold, then lOP is NP-hard. 


Proof. We prove this by showing a reduction from the problem of computing the best rank-1 
approximation of an order 3 tensor (which is NP-hard (Hillar and Lim|2013)) to I OP. Consider any 




-0 G M’’!^’’2^’’3, where ri,r 2 ,ra G M+. This defines t/’ to be an order 3 tensor. We define p = ri+r 2 + r^, 
and suppose the parameter vector is given hy 9 = {a,b,c) G 0 = M'’, where a G b G and 
cGM’' 3. Also define the discrete set U = [ri\x [r 2 ] x [ra], and suppose that u = {a,f3,y) is uniformly 
distributed over U. Furthermore, suppose y is a random variable given by which means that 

y is dependent on u since u = (a,/3,7). Then we define the following forward optimization problem 

5(u,6') = argmm(^x-(a)„-(6)/3-(c)T,^ . (3) 


This forward optimization problem is a quadratic program (QP) when {u,6) is fixed, and so the 
solution set is S{u,9) = {a)a{b )Note that the solution set consists of a single point. Next, 
observe that 


^ r2 rg 

mm Q{9) = mm - - (a)„ • (6)^ • (c)^ j , 

^ _ _ 1 a — 1 — 1 


(4) 

a—1 /3—1 7—1 

where we have converted the expectation into a weighted sum using the fact that u is uniformly 
distributed over U. Observe that Q is the problem of computing the best rank-1 approximation 
to an order 3 tensor (Hillar and Lim 2013). □ 

Remark 1. Inapproximability results for lOP can be shown under the setting where 0 is allowed 
to be a discrete set (i.e, A1 holds, but A2 does not hold). In particular, there is a straightforward 
reduction from the shortest vector problem. This implies that lOP is NP-hard to approximate to 
within any factor up to for any e > 0 (Haviv and Regev 2012). 

Remark 2. Polynomial-time solvability of lOP is possible in very specific settings. For instance, 
if the solution set of FOP is S{u,9) = argmin^jj^^ — 2{6 + u) ■ x} = 9 + u, then lOP is a QP: 
mingge {lE((y — 9 — u)^)}, and its minimizer is 9* = E(y — u). 

This problem lOP-SAA is a bilevel program, and bilevel programs are usually difficult to solve 


(Dempe et al. 2015). In fact, lOP-SAA is NP-hard to solve. 


Corollary 1. //A1,A2 hold, then lOP-SAA is NP-hard. 


Proof. We show this result using the same construction used to prove Theorem[^ In particular, 
observe that if {ui ,..., «„} = U, then lOP-SAA is equivalent to lOP , which is NP-hard by Theorem 
Finally, note that the condition {ui,... ,n„} = U occurs with nonzero probability since the set 
U is finite and since the Ui are sampled uniformly from U. □ 

Remark 3. Inapproximability results for lOP-SAA can be shown under the setting where 0 
is allowed to be a discrete set (i.e, Al holds, but A2 does not hold). In particular, the same 
construction in Remark can be used to shown lOP-SAA is NP-hard to approximate to within 
any factor up to 24°®'^)^ for any e > 0 (Haviv and Regev 2012). 

Remark 4. Polynomial-time solvability of lOP-SAA is possible in very specific settings. For 
instance, the construction in Remark leads to an instance of lOP-SAA that is a QP. 


(Haviv and Regev 2012 
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2.4. Statistical Inconsistency of Heuristic Approaches 

We begin with two statistical definitions of consistency: risk consistency and estimation consistency. 
These definitions are stated in order of increasing stringency, meaning that risk consistency is 
necessary (in situations with sufficient continuity) for estimation consistency. The first definition 
relates to the best predictions possible using the given forward optimization problem. 

Definition 1 (Risk Consistency). An estimate (9„ G 0 is risk consistent if 

Q(0„)^min{Q(0) I 0E0}. (5) 

We should interpret the function Q{9) as the expected prediction error when the parameter values 
are 9, where the prediction is the solution set S{u,9). And so the above definition is stating that 
an estimator is risk consistent if the expected prediction error of the estimate 9n converges in 
probability to the minimum prediction error possible when we use the forward optimization model 
described by FOP and constrain 9 to belong to 0. In other words, an estimator is risk consistent 
if it asymptotically provides the best predictions possible. 

The second statistical definition relates to the situation where the forward optimization model 
described by FOP is correct and there is a true parameter. In particular, it applies to situations 
where the below identifiability condition is satisfied. Briefly summarized, the identifiability 
condition is satisfied when FOP is such that two different parameter values 9i and 02 lead to 
two different distributions for measurements of the decision data Ui. More details and clarifying 
examples are found in Appendix 

IC. There exists a unique 9o e 0 such that the following three sub-conditions hold: (i) y = ^ + w, 
where ^ G S{u,9q), K{w) = 0, < -Foo, and u,^ are independent of w, (ii) for all 0 G 0 \ 0o 

there exists U{9) such that P(nG^(0)) >0 and dist{S{u,9),S{u,9Q)) >0 for each u£ly({9), 
and (hi) for each fixed 0 G 0 we have P({u : S{u, 9) is multivalued}) = 0. 

The first sub-condition of the identifiability condition is stating that the solution data Ui is a noisy 
measurement (with noise random variable w) of a point that belongs to the solution set S{ui,9o), 
and the second sub-condition is stating that when 0 is different from 0o then this leads to different 
solution sets. This second sub-condition is necessary, because otherwise we could not distinguish 
the predictions of FOP when the parameters 0 differ from 9o- The third sub-condition eliminates 
pathological cases that occur when the solution set at a fixed 0 is so large that it approximately 
encompasses all possible solutions. Note that this third sub-condition is mild, and examples where 
it is satisfied include when (i) FOP is strictly convex, or when (ii) FOP is a linear program with 
random coefhcients drawn from a continuous distribution; it holds for other examples as well. The 
second statistical definition is related to this identifiability condition. 
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Definition 2 (Estimation Consistency). Suppose IC holds. An estimate G 0 is estima¬ 
tion consistent if 

L ^ Oo. ( 6 ) 

Stated in words, an estimate is estimation consistent if it converges in probability to the true 


parameter values 6q. This is the classical notion of consistency of a statistical estimator (Bickel 


and Doksum 2006). 


Though these statistical notions of consistency are quite natural, it is the case that existing 
heuristic approaches for solving the inverse optimization problem are statistically inconsistent. We 


will use VIA to refer to the variational inequality method of Bertsimas et al. (2014), and we refer 


to the KKT conditions approach of Keshavarz et al. (2011) as KKA. 


Proposition 1. Suppose A1,A2 and IC hold. Then VIA (Bertsimas et al. 2014) and KKA 
\Keshavarz et al. 2011) are not estimation consistent. 

Proof. We show this using a counterexample. Suppose FOP is min{x^ — (0-!-n) -x | x G [0,10]}, 
and note its solution set S{u, 9) = rninj^^^, 10} is single-valued. Assume the distribution of u is 


u = 


0, with probability (w.p.) | 
20, w.p. I 


and that the distribution of w is 


-T 


w = 


w.p. 


(7) 


( 8 ) 


^+1, w.p. I 

Finally, suppose y = S{u,6) + w, 0 = {0gM:O<0< 10}, and 9q = 10. By construction, this 
problem satisfies A1,A2,IC. Also, observe that the joint distribution of (u,y) is 


o' 

w.p. 1 

( 0, 6), 

w.p. 1 

(20, 9), 

w.p. 1 

(20,11), 

w.p. 1 


{u,y) = < 


We show that both VIA and KKA are not estimation consistent for this problem. 
We begin with VIA. This approach solves 


(9) 


The constraint 


• 1 o 

mm - 

eee ” 

s.t. Vf{yi,Ui,9) ■ {xi-yO > -ei,Vxi G [0,10], Vi G [n] 
Vf{yi,Ui,9) ■ {xi-yO > -e,,Vx, G [0,10] 


( 10 ) 


( 11 ) 


is a variational inequality, and VIA exactly reformulates this using linear duality. We operate with 
the original variational inequality since the reformulation in VIA is exact and does not change the 
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solution. If Ui = 4, then a straightforward calculation gives that 0 is equivalent to the constraint: 
Ci > 4 • (8 — 0) if 0 < 8, and > —6 • (8 — 0) if 0 > 8. If Ui = 6, then ( [IT| ) is equivalent to the constraint 
> 6 • (12 — 9). If Hi = 9, then pTj ) is equivalent to the constraint > 2 + A:. Finally, if yi = 11, 
then 0 is equivalent to the constraint: > 11 • (2 — 0) if 0 < 2, and > 2 — 0 if 0 > 2. Next, we 

solve the problem min{e^ | 0} for each possible value of yi and 9. If = 4, then the minimum is 

16 • (8 — 9Y if 0 < 8, and 36 • (8 — 9Y if 0 > 8. If yi = 6, then the minimum is 36 • (12 — 0)^. If yi = 9, 
then minimum is (2 + 0)^. If = 11, then the minimum is 121 • (2 — 9Y if 0 < 2, and 0 if 0 > 2. 
Thus, we have 

r 36 • (12 - 9f + (2 + 9f + 121 • (2 - 9f + 16 • (8 - 9f, if 0 < 2 

4-E(e^)= i 36-(12-0)2 + (2 + 0)2 + 16-(8-0)2, if0e(2,8] (12) 

[36 • (12 - 9f + (2 + 9f + 36 • (8 - 0)2, if 0 > 8 

‘Finally, we solve the optimization problem min{E(e2) | 0 G [0,10]}. A simple calculation gives 
that the minimum occurs at 0* = ^ 

probability to 0*, because (i) we can exactly reformulate (10) as 


9.8356. However, the minimizer of (10) will converge in 


■ 1 9 

Him - > e~ 

ee0 ” ^*-1 


s.t. e- = 


yi G [n] 


(13) 


' 16 • (8 - 0)2 • 1(0 < 8) + 36 • (8 - 0)2 • 1(0 > 8), if y, = 4 
36-(12-0)2, ify. = 6 

(2 + 0)2, ify, = 9 

[ 121 • (2-0)2.1(0 < 2), ify* = ll 

which (ii) implies we can apply the uniform law of large numbers pennrich 1969) since ef as dehned 


in (13) is a continuous function, and thus (iii) we get convergence of the minimizer from a standard 


consistency result in statistics (see for instance Theorem 5.7 in (van der Vaart 2000) or Theorem 


5.2.3 in (Bickel and Doksum 2006)). This shows VIA is not estimation consistent, since 0o = 10. 
Next, we consider KKA. This approach solves 

kilP 


• 1 

mm - > 


=1 


s.t. Vf{yi,Ui,9) - (Ai)i + (Ai )2 = (e*)i 
~ {^i)l ■ Vi = {^1)2 
{K)2 • {Ui — 10 ) = {^ 1)3 

A,,>0 


(14) 


We first solve the problem (14), with re = 1, for each possible value of yi and 0. If y^ = 4, then the 


minimum is • (8 — 0)2 if 0 < 8, and p • (8 — 0)2 if 0 > 8. If y^ = 6, then the minimum is p • (12 — 0)2. 
If y* = 9, then the minimum is 1 • (2 + 0)2. If y^ = 11, then the minimum is • (2 — 0)2 if 0 < 2, 
and 1 • (2 — 0)2 if 0 > 2. Thus, we have 

' p • (12 - 0)2 + 1 . (2 + 0)2 + 1|1. (2 - 0)2 + p . (8 - 0)2, if 0 < 2 


4-K(||e.||2) = 


p • (12 - 0)2 + 1. (2 + 0)2 + 1. (2 - 0)2 + p . (8 - 0)2, 


17 

36 


P • (12 - 0)2 + 1. (2 + 0)2 + 1. (2 - 0)2 + P . (8 - 0)2, 


if 0G (2,8] 
if 0>8 


(15) 
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Finally, we solve the optimization problem min{E(||ei|p) | 0G [0,10]}. A simple calculation gives 


that the minimum occurs at 0* = ~ 6.5903. However, the minimizer of (14) will converge in 


probability to 9*, because (i) we can exactly reformulate (14) as 


• 1 II I 

mm - > . , eJ 
see " ^*=1 " ' 


s.t. 


16 

17 

36 


• (8 - 0)2 • 1(0 < 8) + P • (8 - 0)2 • 1(0 > 8), if y, = 4 

if y* = 6 
if y* = 9 
if = 11 


,,^.§•( 12 - 0 ) 2 , 

^(2 + 0 )^ 

121 


Vi G [re] 


(2-0)2.i(0<2) + i.(2-0)2 


which (ii) implies we can apply the uniform law of large numbers (Jennrich 1969) since 


(16) 


as 


defined in (16) is a continuous function, and thus (iii) we get convergence of the minimizer from a 


standard consistency result in statistics (see for instance Theorem 5.7 in (van der Vaart 2000) or 


Theorem 5.2.3 in (Bickel and Doksum 2006)). This shows that KKA is not estimation consistent, 
since 9q = 10. □ 


Corollary 2. Suppose A1,A2 hold. Then VIA l[Bertsimas et at. 2014) and KKA (Keshavarz 


et al. 2011) are not risk consistent. 


Proof. Risk consistency is necessary for estimation consistency when Q{6) is continuous. For 
the counterexample in the proof of Proposition [^ we have 

Q(0) = e(|| 2/ - min{^, lOjf) = | • ((4 - f )2 + (6 - f )2 + (9 - 10)2 + (11 - 10)2) (17) 


since 0 = {0 gM:O< 0< 10}. This Q{9) is continuous, and so the corollary follows from Proposition 
(As an aside, note that argmin{Q(0) | 0 G 0} = 10, which is the correct parameter value.) □ 
The intuition for why VIA and KKA are statistically inconsistent is that they are minimizing 
an incorrect measure of error: These approaches generate an estimated set of parameters that 
minimizes the level of suboptimality of the measured solution data. However, this leads to biased 
estimates because suboptimality is measured by (i) deviations in the value of the objective function 
of FOP and (ii) the amount of constraint violation of FOP, whereas noise directly perturbs the 
solution data. This is in contrast to our approach (as exemplified by lOP-SAA) which generate an 
estimated set of parameters that minimizes the deviation between predicted and measured solution 
data. This distinction between suboptimality and deviations in the solution data becomes most 
apparent (and critical) in problems with constraints. 


3. Consistent Estimation for Inverse Optimization Problem 

Given the statistical inconsistency of existing heuristics, we propose to solve the noisy inverse 
optimization problem by instead solving SAA-IOP. First, we will need to impose a regularity 
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condition to ensure that FOP and lOP-SAA are numerically well-posed: 

Rl. For each u^U and 0 G 0, the feasible set ^{u,6) is closed, bounded, and has a nonempty 
relative interior (i.e., relint($(M, 0)) / 0). The feasible set ^{u,9) is also absolutely bounded, 
meaning there exists M > 0 such that ||x|| < M, for all x G $(u, 0), n G and 6 eQ. 


Condition Rl is equivalent to requiring FOP has a strictly feasible point (i.e.. Slater’s condition 
holds), and that the feasible set of FOP is closed and bounded. The first sub-condition requiring the 
feasible set be closed and bounded is needed to ensure the existence of well-posed primal and dual 
solutions, and it could be replaced by more general conditions. For instance, we could have instead 


assumed FOP satisfies the uniform level-boundedness condition (Rockafellar and Wets 1998). We 


use the above for simplicity of stating the results. The second sub-condition is needed to ensure 
we can apply the Berge Maximum Theorem ( ]Berge 1963). 

The simplest case of statistical consistency of SAA-lOP occurs when the function f{x,u,9) is 
strictly convex, because of the following result: 

Proposition 2. Suppose A1,A2 and HI hold. If f{x,u,9) is strictly convex in x for fixed u ^ U 
and 0 G 0, then Q„(0) is continuous. 

Proof. Because the feasible set $(u,0) is convex for fixed u,9 by A1 and has a nonempty 


interior by Rl, this means $(u, 0) is continuous in 9 by Example 5.10 from (Rockafellar and Wets 


1998). Thus, we can apply the Berge Maximum Theorem ( Berge|1963 ) to FOP. This implies S{u,9) 
is upper hemicontinuous in 9 for fixed u€U. However, S{u,9) consists of a single point for fixed 
uGU and 0 G 0, because the objective function is strictly convex and since Rl holds. Consequently, 
S{u,9) is a continuous single-valued function for fixed u G 0 (see for instance Theorem 2.6 in 
( Rockafellar and Wets| 1998)). Thus, we can apply the Berge Maximum Theorem to RISK-SAA, 
and this implies that Qn{9) as defined in RISK-SAA is continuous. □ 


In this case, we can prove risk and estimation consistency using standard arguments (Jennrich 


1969 

van der Vaart 

20001 

Bickel and Doksum j2006 


large numbers (Jennrich 1969). However, this approach cannot be applied to the more general 


case where f{x,u,9) is not strictly convex. In particular, when f{x,u,9) is not strictly convex, the 
function Qn{9) will not generally be continuous. And so a different argument is required because 
the uniform law of large numbers does not apply to discontinuous functions. 


Our approach will be to use a statistical consistency result originally due to Wald (1949) that 


uses a one-sided bounding argument. The advantage of this approach is that it only requires 
lower semicontinuity, which we show always holds for Q„(0). However, this result only implies 
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the estimates converge in probability to the set of minimizers of Q{0). This cannot imply risk 
consistency in the general case because Qn{0) is lower semicontinuous, which means that Q{6n) 
can remain bounded from the minimum Q{9). And so for the general case, we will show that a 
weak risk consistency result holds. 

To develop the statistical consistency results for the most general case, we will develop a regu¬ 
larized version of RISK-SAA that is guaranteed to be continuous. The first step of this construction 
involves proposing a new reformulation for bilevel programs that we call a duality-based reformula¬ 
tion. Next, we use this reformulation to construct a regularized version of RISK-SAA and prove its 
continuity. We use this regularized version to prove statistical consistency results about lOP-SAA 
and a regularized version of lOP-SAA. 


3.1. Duality-Based Reformulation 

One approach to solving bilevel problems (such as lOP-SAA) is to reformulate the problem as a 
normal (i.e., single level) optimization problem by replacing the constraints Xi €S{ui,9) with an 


optimality condition (Dempe et al. 2015). One possibility is to replace x* ^S{ui,9) by the KKT 
conditions of FOP, and another possibility is to upper bound the objective function using the value 
function f{xt,Ui,9) <V{ui,9). Unfortunately, these approaches often encounter numerical difficul¬ 
ties. The KKT approach leads to a nonlinear program with combinatorial complexity, because of 
the complimentary slackness in KKT. The value function approach is difficult to implement because 
closed-form expressions for the value function are not available except for very special cases. 

Here, we present a new optimality condition. Given the numerical difficulties of existing 
approaches, we propose to solve bilevel programs (such as lOP-SAA) by using the Lagrangian dual 
function to upper bound the objective function. The following proposition shows that our idea of 
using the dual as an upper bound represents a novel optimality condition. 

Proposition 3. Suppose A1,A2 and R1 hold. Then a point is optimal x G S{u,9) if and only 
if there exists a corresponding A G for which x, A satisfy the inequalities 

f{x, u, 9) — h{X, u,9) <0 

g{x,u,9)<0 (18) 

A>0 

where h{X,u,9) is the Lagrangian dual function of POP. 

Proof. We first prove the forward direction. Consider any optimal point x € S{u,9), and note 
that it satisfies g{x,u,9) < 0 since the feasible set is nonempty by Rl. Conditions A1,R1 imply 
strong duality, meaning there exists A > 0 such that h{X, u, 9) = f{x, u, 9) (see for instance Theorem 


2.165 in (Bonnans and Shapiro 2000)). As a result, this x,X satisfies (18). 
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Next, we prove the reverse direction of the result. Let x be a point that satisfies (18), and note 
that for fixed u, 9 we have 

/(x, u, 9) < h{X, u, 9) < max> {h{X, u,9) | A > O} < min^, {/(x, u, 9) | g{x, u, 0) < O}, (19) 

where the first inequality is a restatement of ( |18[ ), and the last inequality follows by weak duality 
(e.g., (2.268) in ( Bonnans and Shapiro||2000 )). Hence, X G S{u,9) (i.e., X is an optimal point). □ 
As a result, we can exactly reformulate RISK-SAA as the following optimization problem: 


1 " 

Q„(6') = min - V||?/, 

n ^^ 


-X,; 


DB RISK SAA 


s.t. f{xi,u,,9) - h{Xi,Ui,9) <0, Vi G [n] 
g{x,,u^,9) <0, ViG[n] 

Xi >0, Vi G [n] 

One important feature of this reformulation is that it is a convex optimization problem for fixed 
values of 9. 

Proposition 4. Suppose A1,R1 hold. Then DB-RISK-SAA is a convex optimization problem 
for fixed 9. 

Proof. Recall f{x.i,Ui,9),g{xi,Ui,9) are convex in Xi for fixed Ui,9, by Al. Furthermore, the 
Lagrangian dual function h{Xi,Ui,9) is concave in Xi for fixed Ui,9 (see for instance Proposition 


11.48 in (Rockafellar and Wets 1998)). Moreover, there exists > 0 such that h{Xi,Ui,9) is finite, 
because A1,R1 holds (see for instance Theorem 2.165 in ( Bonnans and Shapiroj 2000)). Thus, 
f{xt,Ui,9) — h{Xi,Ui,9) is a convex function in Xi for fixed Ui,9. Since the objective function is 
convex, this means that DB-RISK-SAA is a convex optimization problem. □ 

3.2. Regularized Formulation 

Recall that Q„(-) is generally not continuous even when A1,A2,R1 hold. Consequently, we develop 
a regularized version of the duality-based problem that is guaranteed to be continuous. We define 
the e-regularized version of the duality-based problem to be 


-Xi 


R-DB-RISK-SAA 


Q„(6»;e) = min 

Xi,\i 71 

i—\ 

s.t. f{x,,u,,9) - h{Xi, Ui,9)< e, Vi G [n] 
g{x,,u,,9) <e, Vi G [n] 

Xi >0, Vi G [n] 

And we associate this to a regularized version of the sample average approximation of the inverse 
optimization problem: 

R-IOP-SAA mm{Qn{9;e) \ 0G0}. 
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The idea of this regularization is that we relax the optimality conditions to allow points Xi to be 
an e-solution. Recall that a point 


G e-argmin{/(x) | g{x) < 0}, 


( 20 ) 


if (i) f{x^) — f* <e and (ii) g{x^) < e, where f* = min{/(x) | g{x) < 0}. 

Proposition 5. Suppose A1,A2 and R1 hold. Then a point x is an e-solution if and only if 
there exists a corresponding A G for which x, A satisfy the inequalities 

f{x, u, 6) — h{\, u,6) <e 

g{x,u,e)<e (21) 

A>0 

where h{\,u,9) is the Lagrangian dual function o/FOP. 

Proof. We first prove the forward direction. Consider any point x that is an e-solution, and 
note it satisfies g{x,u,9) < e by definition. Conditions A1,R1 imply strong duality, meaning there 
exists A > 0 such that h{X,u, 9) = mm{f{x, u, 9) \ g{x,u,9) <0} (see for instance Theorem 2.165 in 


(Bonnans and Shapiro 2000)). Combining this with the definition of an e-solution, we have 


f{x,u,9) — h{X,u,9) = f{x,u,9) — mm{f{x,u,9) \ g{x,u,9) < 0} < e. 


( 22 ) 


As a result, this x,X satisfies (18). 


Next, we prove the reverse direction of the result. Let x be a point that satisfies (21), and note 
that for fixed u, 9 we have 


f{x, u, 9) < h{X, u,9)-\-e< maxA {/i(A, u,9) | A > O} -Fe < min^; {/(x, u, 9) \ g{x, u, 9) < O} -Fe, (23) 


where the first inequality is a restatement of (21), and the last inequality follows from weak duality 
(e.g., (2.268) in ( Bonnans and Shapiro|||2000 )). This means that x must be an e-solution. □ 

One benefit of this regularization is that it ensures convexity of R-DB-RISK-SAA when 9 is 
fixed. 


Proposition 6. Suppose Pl\,A.2 andRl hold. T/ien R-DB-RISK-SAA is a convex optimization 
problem for fixed 9. 

Proof. The proof is identical to that of Proposition □ 

Though the above propositions show that the regularization is equivalent to replacing optimality 
conditions with e-optimality conditions while maintaining convexity for fixed values of 0, the main 
benefit of the regularization is that it ensures the function Q„(0;e) defined in R-DB-RISK-SAA is 
continuous in 9, e for any e > 0. 
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(24) 


Proposition 7. Suppose A1,A2 and R1 hold. Then the function Qn{9]e) is jointly continuous 
in 9, e for any e > 0. 

Proof. The solution set S{u,9) is nonempty nnder A1,R1 (see for instance Theorem 1.9 of 
( Rockafellar and Wets||199^ ). Pick any Xi eS{u,9), and let Xi be snch that Xi,Xi satisfy (18) - 
this Xi exists by Proposition [3j Next, consider the sets 

S{u„9;e) = {x : f{x,Ui,9) - h{X„u^,9) < e, g{x,Ui,9) < e} 

S{u,,9-,e) = {x:f{x,Ui,9)-h{X,u,,9)<e, g(x,Ui,9) <€, A > 0} 

and note that S{ui,9;e) = S{ui,9-,e), since as shown in the proof of Proposition we have 
h{X,Ui,9) < h{Xi,Ui,9) for all A > 0. Observe that the innctions f{xi,Ui, 9), g{xi,Ui, 9) are continu- 
ons and convex from Al, and the point Xt belongs to the interior of S{ui, 9] e) since it satisfies (18). 
Thus, we can apply Example 5.10 from (Rockafellar and Wets 1998): This yields that S{ui,9;e) 
is continnous in 9,e for any e > 0, and so we also get continuity of S{ui,9-,e) by its equality 
to S{ui,9-,e). Since R-DB-RISK-SAA can be written as Q„(d;e) = X)r=i 11^* I ^ 

S{ui,9;e), Vi G [n]}, we are able to apply the Berge Maximnm Theorem ( Berge|[l963[ ). This implies 
continuity of Q„(d; e) in 9, e for any e > 0. □ 

A point of note is that within the above proof, we show that the set of e-optimal solutions of a 
parametric convex optimization problem S{ui,9‘,e) is continnous with respect to the parametriza- 
tion 9; this is in contrast to the solution set of a parametric convex optimization problem S{ui,9), 
which is in general only npper hemicontinnous with respect to the parametrization 9. The case of 
a parametric strictly convex optimization problem is the exception, which as shown in the proof 
of Proposition has a continuous (with respect to the parametrization 9) solution set. 

The fnnction Qn{9\e) will not be jointly continuous in 0, e at e = 0. However, it satisfies another 
property that is useful for solving lOP-SAA: 

Proposition 8. Suppose A1,A2 and R1 hold, and let > 0 6e a monotone decreasing sequence 
with Cj, —>• 0. Then we have min{Q„(0; e^) \ 0 G 0} —)• min{Q„(0) | 0 G 0} and 


limsupj,(argmin{Q„(d; Cy) | 0 G 0}) C argmin{Q„(d) | 0G0}. 


(25) 


If Zy > h is a monotone decreasing sequence with Zj, —)■ 0, then we also have 

limsnp^(2;,,-argmin{Q„(0; Cj,) | 0 G 0}) C argmin{Qra(0) | dG0}. (26) 

Proof. Let C„(0, e) be the feasible set of R-DB-RISK-SAA, and define {X, A) = {xi, A^, Vf G [n]}. 
Suppose (A, A) G Cn{9, a), where a > 0. Then for any (3>a we mnst have {X, A) G C„(d, (3) by the 
definition of the constraints in R-DB-RISK-SAA. This means that 


C„(d,ei)DC„(0,e2)3--- 


(27) 
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As a result, the set 'Dn{9,e„) = {6,X,A:6 € & and (A,A) gC„( 0,e^)} is also monotone nonincreas¬ 
ing: 

iPn(0,ei)DP„(0,e2)2--- (28) 


Also, the feasible set ^{u,9) is convex for fixed u,9 by A1 and has a nonempty interior by Rl. 


This means $(n,0) is continuous in 9 by Example 5.10 from (Rockafellar and Wets 1998), and so 


we can apply the Berge Maximum Theorem (Berge 1963) to FOP. This implies S{u,9) is upper 


hemicontinuous in 9 for fixed u€U. By Remark 3.2 of (Dempe et al. 2015), this means Qn(9} is 
lower semicontinuous. Thus, by Proposition 7.4.d of ( Rockafellar and Wets||1998 ) we have that the 
extended real-valued function {Qn{9; \ 9 £ 0} epiconverges to the extended real-valued function 


{Qn{9) I 9 £ 0}. The result then follows from Exercise 7.32.d and Theorem 7.33 of (Rockafellar 
and Wets||1998|). □ 


Corollary 3. Suppose A1,A2 and Rl hold. Given any d > 0, there exists E,Z > 0 such that if 
9„ £ z-argmin{Q„(0; e) | 0 G 0} for any 0 < z < Z and 0<e<E, then dist(0„, argmin{Q„(0) | 9 £ 

e})<d. 


Proof. This is a restatement of Proposition □ 

These results say that approximately solving R-IOP-SAA is equivalent to approximately solving 

lOP-SAA. 

3.3. Statistical Consistency 

In order to prove statistical consistency, we will need to impose an additional regularity condition 
that ensures expectations of corresponding random variables exist. 


R2. The set 0 is closed and bounded, and lE(y^) < -Foo. 


This regularity assumption ensures that the law of large numbers (Wald 1949, Jennrich 1969 


van der Vaart|2000 ) holds in our setting. The above expectation condition holds in many situations, 
including when y is bounded or when y has a sub-exponential distribution ( Vershynin|||2012 ). This 
allows for settings where IC holds with measurement noise that is Gaussian, Bernoulli, bounded 
support, Laplacian, Exponential, and many other distributions. 

Our first statistical consistency result is that solving R-IOP-SAA is risk consistent. To state the 
result, we must formally dehne the regularized version of the inverse optimization problem. The 
regularized risk is 


R RISK 


Q{9]€)=e( min ||?/-x||^), 

V x^S{u,9;e) / 
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where S{u, 9; e) = {a; G : /(x, u,6) <V {u, 6) + e, g{x, u, 9) < e} is the set of e-solutions to FOP. 
We define the regularized inverse optimization problem to be 

R-lOP min{(5(0;e) | 0G0}. 

The first statistical consistency result specifically concerns nearly-optimal solutions of R-IOP- 
SAA. We say that a sequence of solutions 9n is nearly-optimal for R-IOP-SAA with fixed e > 0 in 
probability if for any 5 > 0 we have 

lim P^dist(0„, argmin{Q„(0; e) | 0 G 0}) > <5^ = 0. (29) 

Theorem 2. Suppose A1,A2 and R1,R2 hold. Given any fixed e > 0, if 9n is nearly-optimal 
for R-IOP-SAA with high probability, then we have Q{9n;e) ^^min|Q(0;e) | 0G 0}. 

Proof. Propositiongives continuity of Qn{9;e). Thus, we can apply the uniform law of large 
numbers ( | Jennrich||1969[ ) , which gives 

supege |<3"(6';e)-Q(6';e)| ^0. (30) 

Consider any 9q G argmin{Q(0; e) | 0 G 0} and any 9i G argmin{Q„(0; e) | 9 G 0}. By assumption 
Qni9i;e) < Qn{9o]e), and so we have 

Qi9n] e) -F Qni9n] e) ~ Q{9n]^) + Qn{9i',e) — Qn{9n'i e) < Q(^oi f) + Qn{9o] e) — Q{9Q]e). (31) 


Rearranging terms gives 


Q{&n\ e) ~ Q{9o) < \Qn{9n\ c) ~ Qifin] e)| + \Qni9u ~ Qni^n'i c)! + |Qn(^oj e) ~ Qi&O] c)!' (32) 

Recall (i) Q{9o;e) < Q(0„;e) by definition of 9^, (ii) Q„(0;e) is continuous, and (hi) is nearly- 
optimal for R-IOP-SAA with high probability. Thus, combining these facts with po] ) and (32) gives 
that Q{9n',e) — Q{9o', e) 0. This is the desired result. □ 

This result says that if choose any e > 0 and solve R-IOP-SAA to generate an estimate 
then the predictions given by the e-solutions to FOP (i.e., 5(rt,0„;e)) are asymptotically the best 
possible set of predictions when the error of predictions is measured using R-RISK. A stronger risk 
consistency result is not possible in the general setting because Q{9) is typically discontinuous, and 
so the above result can be interpreted as a weak consistency result. 

A stronger risk consistency result is possible in the case where f{x,u,9) is strictly convex. We 
say that a sequence of solutions is nearly-optimal for lOP-SAA in probability if for any (5 > 0 
we have 

lim P^dist(0„, argmin{Q„(0) | 0 G 0}) ><5^ =0. (33) 
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Theorem 3. Suppose A1,A2 and R1,R2 hold. If f{x,u, 6) is strictly convex in x (for fixed u G 
U and 9^0) and 9^ is nearly-optimal for lOP-SAA with high probability, then we have Q{9n) 
min{Q(0) | 0 G 0}. 


Proof. Proposition gives continuity of Qn{9). The remainder of the proof is identical to 
Theorem [2j □ 

This result says that when FOP is a strictly convex optimization problem and we solve lOP-SAA 
to generate an estimate 0„, then the predictions given by the solutions to FOP (i.e., S{u,9n)) are 
asymptotically the best possible set of predictions when the error of predictions is measured using 
RISK. The reason it is possible to show risk consistency in this case is that Q{9) will be continuous 
in this setting. 

Our hnal statistical consistency result is that solving lOP-SAA is estimation consistent when IC 
holds. 


Theorem 4. Suppose A1,A2 and R1,R2 and IC hold. If is nearly-optimal for lOP-SAA 


with high probability, then we have 9„ 


■9n. 


Proof. Because the feasible set 4>(n, 9) is convex for hxed u, 9 by A1 and has a nonempty inte¬ 
rior by Rl, this means 4>(ri, 9) is continuous in 9 by Example 5.10 from ( Rockafellar and Wets|1998 ). 
Thus, we can apply the Berge Maximum Theorem (Berge 1963) to FOP. This implies S{u,9) is 


upper hemicontinuous in 9 for fixed u€U. By Remark 3.2 of Dempe et al. (2015), this means Q„(0) 
is lower semicontinuous. Thus, we can apply Theorem 5.14 of (van der Vaart 2000[ ). (Technically, 
this theorem applies to maximizing upper semicontinuous functions, but the results and proof triv¬ 
ially extend to the case of minimizing lower semicontinuous functions.) The result follows from the 
conclusion of Theorem 5.14 of ( van der Vaart||2000 ) if we can show (i) 9q G argmin{Q(0) | 9 G 0}, 
and that (ii) 9o is the unique solution. First, note Q{9) = E(mina;g 5 (u_e) ||^ — x|p)-F since 

f,,x is almost surely independent of w because by IC we have that (i) ^,u are independent of w, 
and (ii) S{u,9) is almost surely single-valued. Since by IC we have ^ £S{u,9o), this means that 
Q{9q) = E(tt;^) and that 9o G argmin{Q(0) | 9 G 0}. Next, consider any 9 £ Q\9o. Then by IC 
we have E[mina;g 5 („_ 6 i) 11^ — xP | ti G ^(0)] > 0 since ^ G 5(ri,0o) and dist(5(ti,0),5(ri,0o)) > 0 for 
each u £U{9). Because P(n G ll{9)) > 0 from IC, this means E(min 2 ,g 5 („_e) |p — xP) > 0 for any 
9 £Q\9o. Consequently, we have Q{9) > Q{9o) for any 0 G 0 \ 0o- ^ 


4. Numerical Approaches to Solving lOP—SAA 

Solving lOP-SAA with Qn{9) as formulated in DB-RISK-SAA is still difficult because it is a non- 
convex problem even under A1,A2,R1. We will propose two approaches to solving this problem. 
The first is an enumeration algorithm that is applicable to situations where p is modest (i.e., the 
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0 G parameter has between 1 to 5 dimensions). The second approach we describe is a semi- 
parametric algorithm, and it can be used in cases where 0 G is higher-dimensional and y has 
a specific distribution. For both algorithms, we will prove that the estimates computed by these 
methods satisfy the conditions required for statistical consistency. 

The difference in the two algorithms is how they trade-off computational and statistical per¬ 
formance. The enumeration algorithm requires exponential in p computation, while the semipara- 
metric algorithm needs polynomial in p computation. But the statistical performance of the meth¬ 
ods will be the opposite. The estimates and risk of the enumeration algorithm are anticipated 
to converge at faster rate than those of the semiparametric algorithm. The reason is that the 
semiparametric algorithm makes use of a nonparametric step (via the L2NW estimator), which 
is well-known to generally converge at a slower rate than a fully parametric approach. Precisely 
characterizing the statistical convergence rates of the two algorithms is left open for future work. 

Though the enumeration algorithm needs exponential in p computation, it is still practical for 
many real-world problems. Many principal-agent problems (e.g. ??) use models where the parame¬ 
ter set is modest in dimensionality (i.e., utility functions with 2 or 3 type parameters). We demon¬ 
strate the practicality of the enumeration algorithm in Section 5 through an energy-related example 
using real data. 

4.1. Enumeration Algorithm 

The main idea of this algorithm is that computing Qn{G) and Qn(0;e) for fixed values of 9 can 
be done in polynomial time since DB-RISK-SAA and R-DB-RISK-SAA are convex optimization 
problems by Propositions!^ andrespectively. This approach enumerates over different fixed values 
of 9 and solves a series of polynomial time problems. However, 0 is a continuous set since because 
it is convex by A2. To enable enumeration, we discretize 0 using a (5-net of 0, which we will call 
T((5). (Here, we define this to mean that T[5) is a finite set such that maxggemintg 7 -( 5 ) p — 0|| < 
(5.) We then compute Q„(0;e) for all 0 G T((5). And our approximate solution is finally given by 
(9„ = argmin{Q„(6';e) | 9^T{5)}. 

This approach requires continuity of Qn{0\() because otherwise performing an enumeration via 
the (5-net T[8) may not get sufficiently close to the optimal value. However, Qn{9;e) is only guar¬ 
anteed to be continuous at e = 0 when f{x,u,9) is strictly convex for fixed u,9 by Proposition 

and since Qn{9]0) = Qn{9) by definition. Hence, we require e > 0 for cases where f{x,u,9) is 
not strictly convex to ensure continuity of Q„(0;e) by Proposition]^ Of course, when f{x,u,9) is 
strictly convex we can set e = 0 and maintain continuity of Qn{9', e). 

This approach is formally presented in Algorithm Importantly, it can be shown that this enu¬ 
meration algorithm generates nearly-optimal solutions of lOP-SAA and R-IOP-SAA. This means 
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Algorithm 1: Enumeration Algorithm 
Data: fixed 5 > 0 and e > 0 
Result: estimate 

1 set T{6) to be (5-net of 0; 

2 foreach 6 £T{S) do 

3 |_ compute Q„(0;e) by solving R-DB-RISK-SAA; 

4 set G argmin{(5„(6*; e) | 0 gT((5)}; 


the statistical consistency results in Section [3i3| apply to the solutions computed by this algorithm. 
In practice, e is chosen to be e = 0 when FOP is strictly convex, and otherwise e is chosen to be a 
small positive value that controls the desired precision of the resulting estimate. 


Theorem 5. Suppose A1,A2 and R1 hold. Given any d> 0, there exists E,A>0 such that ifOn 
is computed using the enumeration algorithm (i.e., Algorithm^ for any 0 < e < i? and 0 < (5 < A, 
then dist(0„, argmin{(5„(0) | 0G0})<d. 

Proof. By Corollary]^ there exists E,Z > 0 such that if G z-argmin{(5„(0;e) | 0 G 0} for 
any 0 < z < Z and 0 < e < E, then dist(0„, argmin{Q„(0) | 6 G 0}) < d. Suppose we choose z = Z. 
Because Quid', e) is continuous in 6 by Proposition]^ there exists A > 0 such that for any 0 < (5 < A 
we have 

min{(5„(6»;e)-(5„(6'o;e) | 6 'g7'(5)} < z, (34) 


where 9q G argmin{(5„(0; e) | 6 G 0}. By construction, we have 


argmin{Q„(0; e) | 0 G T((5)} C z-argmin{Q„(0; e) | 0G0}. (35) 

Next, note the enumeration algorithm returns a solution G argmin{Q„(0; e) | 6 G T((5)}, which 
also satisfies 9n G z-argmin{(5n(0; e) | 9 G 0}. The result follows from applying the first line of the 
proof. □ 

As mentioned above, in the special case where FOP is a strictly convex optimization problem we 
can simplify the algorithm by setting e = 0. We have a corresponding result about the correctness 
of the algorithm in this case. 

Theorem 6. Suppose A1,A2 and R1 hold. If fix,u,9) is strictly convex in x (for fixed u G 
U and 9 G Q), then given any d > 0 there exists A > 0 such that if 9n is computed using the 
enumeration algorithm for e = 0 and any 0 < (5 < A, then dist(0„, argmin{Q„(d) | 9 G 0}) < d. 

Proof. By Corollary]^ there exists E,Z > Q such that if G z- argmin{Qra(d; e) | d G 0} for 
any 0 < z < Z and 0 < e < E, then dist(d„, argmin{Q„(d) | 9 G 0}) < d. Suppose we choose z = Z 
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and e = 0, and note that <5„(0; 0) = Q„(0) by their definitions. Because Q„(0) is continuous in 9 by 
Proposition there exists A > 0 such that for any 0 < (5 < A we have 

- Qr,{6o) \ 9 £T{S)} < z, (36) 

where 9 q G argmin{(5„(0) | 9 G 0}. By construction, we have 

argmin{Q„(0) | 0 G T(5)} C z- argmin{Q„(0) | 0G0}. (37) 

Next, note the enumeration algorithm returns a solution G argmin{Q„(0) | 9 G T{S)}, which 
also satisfies G z-argmin{Q„(0) | 9 G 0}. The result follows from the first line of the proof. □ 

4.2. Semiparametric Approach 

0ur second approach to solving lOP-SAA is a semiparametric approach. We will need to make an 
additional assumption about the structure of the problem, as well as impose two more regularity 
conditions, in order to be able use this approach. We begin with the additional assumption. 


A3. The constraint function g{x, u, 9) is independent of 9, meaning it can be written as g{x, u, 9) = 
go{x,u). The objective function f{x,u,9) is affine in 9, meaning it can be written as 

p 

f{x,u,9) = fo{x,u) + '^{9)Jj{x,u). (38) 

i=i 

Independence of the constraint g from 9 is required because the semiparametric approach relies on 
fully knowing the feasible region of the forward problem. We note that this is not a particularly 
strong assumption, since in utility estimation settings one would expect the unknown parameters 


to appear in the objective function of the forward problem. Keshavarz et ah (2011) and Bertsimas 


et al. (2014) also assume that the feasible region of the forward problem is independent of the 


unknown parameters. The second part of A3 ensures that the Lagrangian dual function h{X,u,9) 
is concave in 9. This will enable efficient computation in our semiparametric approach. Next, we 
describe the two additional regularity conditions. The first is 


R3. The objective function f{x, u, 9) is strictly convex in x (for fixed uGU and 0 G 0) and twice 
continuously differentiable in x,u,9, and the constraints g{x,u,9) are continuously differentiable 
in X, u, 9. 


Condition R3 ensures smoothness in the objective function and constraints. The reason we also 
include a strict convexity assumption is that it acts a regularity condition: Strictly speaking, we 
require the second-order growth condition 


f{x, u,9) >V{u, 9) + c - [dist(a;, 5(n, 0))]^, 


(39) 
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for some c > 0 and all x G ^{u,9). Unfortunately, this condition can be difficult to check even 


though it has been completely characterized for convex optimization problems (Bonnans and Ioffe 


1995). Fortunately, strict convexity with constraint qualihcation implies this second-order growth 


condition (Bonnans 1992). Also, note that our results could be extended to the case where the 


problem satisfies the first-order growth condition 


f{x, u,9) >V{u, 9) + c - dist{x, S{u, 9)), 


(40) 


for some c > 0 and all x G ^{u,9). We do not consider this extension in the present paper. 

R4. The noise random variable w has a sub-exponential distribution, meaning there exists c > 0 
such that P(|r/;| >t)< exp(l — tjc). Also, the probability density function fi{u) of u is continuously 
differentiable and is bounded from zero (i.e., min„g;^ ^{u) > 0 ). 


This regularity condition ensures the distribution of the random variables w,u are not extreme. 
Most commonly used heavy-tailed noise distributions are sub-exponential distributions, and so 
R4 is satished by Gaussian, Bernoulli, bounded support, Laplacian, Exponential, and many other 


distributions (Vershynin 2012). Also, the regularity condition on ^{u) implies U is bounded. 


The idea behind the semiparametric approach is the observation that R-DB-RISK-SAA is convex 
in 9 for fixed x when A3 holds. However, because the yi are measured with noise, we cannot 
simply make the substitution Xi = yi- To overcome this difficulty, we first de-noise the yi using 
a nonparametric estimator. Specifically, we define the 1 ^ 2 -regularized Nadaraya-Watson (L2NW) 


estimator (Aswani et al. 2013) as 




Xi = ■ 


(41) 


where 7 > 0 is the bandwidth parameter, u > 0 is the i 2 -'^^ 9 ularization parameter, and K : M"* —>■ M is 
a kernel function that satisfies the following properties (i) K{u) > 0, (ii) K{u) = 0 for ||n|| > 1, (hi) 
K{u) = K{—u), and (iv) f K{u)du = 1. A common example of a kernel function is the Epanechnikov 
kernel, which is defined as the function 


0 , otherwise 


(42) 


The L2NW estimator (41) is computed in polynomial time, and it serves to de-noise the Xi in the 


manner described by the following proposition. 
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Proposition 9. Suppose A1 andRl-R4 hold. If^ = 0{n ^/(sm+i)^ and a = 0{'y), then S{u,9) 
consists of a single point, and for sufficiently large n we have we have 


^^max ||xi — 0 o)|| > n < ki exp ^ , 


(43) 


where ki,k 2 > 0 are constants. In particular, this implies maxig[„] ||xi — 5(ttj,0o)|| 0. 

Proof. The first part follows from the strict convexity assumption in R2, and the third part 
follows directly from the second part. And so we focus on proving the second part. We will prove 


this using a truncation argument (see for instance (Tao 2012)). 

First, note that the function f{x,y) = x/y over the domain {x,y) G [—M, M] x [a,a + 1] 
is Lipschitz continuous with constant Li = yj (M^ + (u + lY)/a'^. Suppose we choose M = 
max„gi^ ||/i('u)5(u, 0o)|| + 1- As a result, using Lemmaand Lemmawe have 


— _ ti(ui)S{ui,eo) II . 

<t+mK) II ^ '' 


<^+ “*(||7-’"• hEU % • II >^) + 

EE]=iy 3 ■ -iJ.{u,)s{u„eo)\\ >t/Li) (44) 

< 2 exp ^ - 2 c 2 nE"^ ■ {t/Li - Ci + 2exp - 2 c 2 n'f^ • (1 - Ci - 7 )^^ + 

2 exp - 2c5nE"" ■ if/ Li - C 3 • 7 ^^ - C 4 • 7 )), 

for t > max{ci • 7 , C 3 • 7 ^^ + 04-7}. Next, observe that the function if>{x, y) over the domain 

{x,y) G [min fj.{u)S{u, 9), max fj.{u)S{u, 9)] x [min^(u),max/r(u)], (45) 

u^U uGU uGU u£U 

is Lipschitz continuous with some constant L 2 > 0 since (i) the denominator of if is bounded away 
from zero because of R4, and (ii) the numerator of is bounded by R1,R4. Thus, we have 


“(||x,-5('U„0o)|| >i) < p(||x, - II > t - (T/L 2 ) : 


(46) 


for t > oIL 2 . Suppose we choose 7 = 0{n 2 /( 8 m+i)^^ ^ _ 0{E), and t = n i/(i 6 ™+ 2 )^ Then combining 

(47) 


(44) and (46) gives that for sufficiently large n we have 

-5(n,,0o)|| >n"PP®'"+P^ < Cg exp , 


(48) 


where CgjCy > 0 are constants. And so combining the union bound with (47) gives 

P(^max||x,-5(M„0o)|| >n-PP®’"+P) < nP(|||x, - 5(u„ 0o)|| >n-PP®™+ 2 ) 

< Cg exp ^ + log . 

The final implication of the result follows by noting that —)• 0 and Cgexp(—+ 

logn) —> 0 as n —> 00 . □ 
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Before we present our algorithm, we need one more result that provides additional understanding 
for the semiparametric approach. Consider the following optimization problem 


ROBUST-IOP-SAA 


minm^x 


0G0}, 


Proposition 10. Suppose A1,A2 and HI hold. Then the solution sets in 6 o/ROBUST-IOP- 
SAA and lOP-SAA are equivalent, and the optimal value o/ROBUST-IOP-SAA occurs at e = 0. 

Proof. Let C„(9,e) be the feasible set of R-DB-RISK-SAA. As shown in the proof for Proposi¬ 
tion [8j the feasible set satisfies 

C„(0,O)CC„(0,e), (49) 

for all e > 0. As a result, we must have that Q„(0;O) > Qn{d;e) for all e > 0. This means that 
maXe>oQ„(0; e) = Qn{0;0). The result holds because Qn{0;0) =Qn{9) by definition. □ 

Given the above relationship that the optimal value of ROBUST-IOP-SAA occurs at e = 0, we 
propose to solve the inverse optimization problem using the following formulation: 


SP lOP RISK SAA 


1 " 

On G arg min — N 
n 

i=l 


s.t. f{x„u,,9) - h{Xi, Ui,6)<ei, Vi G [n] 
\i >0, Vi G [n] 


where the Xi are as defined in (41). This is a convex optimization problem. 

Proposition 11. Suppose A1-A3 and R1 hold. Then SP-IOP-RISK-SAA is a convex opti¬ 
mization problem. 


Proof. Since the Lagrangian dual function is defined as 

h{X,u,9) = inf (/o(a;, r) + '*^) + Ei=i('^)i(Vo(a;,, 


(50) 


it is the pointwise infimum of a family of affine functions of (A,0) and hence concave in (A,0) 


(Boyd and Vandenberghe 2009). Since f{x,u,9) is affine in 9 by A3, this means the constraint 
f{fXi,Ui,9) — h{Xi,Ui,9) < €i is convex in 9,ei. The objective function is linear, and so the entire 
optimization problem is convex. □ 

We now have the elements to construct our semiparametric algorithm, which is a two-step 
approach. In the first step, we de-noise the Pi data using the L2NW estimator given in ( [4T| ). And 
the second step is to solve SP-IOP-RISK-SAA. This approach is formally presented in Algorithmj^ 
Importantly, it can be shown that this semiparametric algorithm generates nearly-optimal solutions 


of lOP-SAA. This means the statistical consistency results in Section apply to the solutions 
computed by this algorithm. In practice, the values of 7, a can be chosen using standard approaches 


from statistics like cross-validation (Hastie et al. 2009). 
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Algorithm 2: Semiparametric Algorithm 


Data: fixed 7 > 0 and cr > 0 
Result: estimate 

1 foreach i G [n] do 

2 1^ compute Xi using using (41); 

3 compute On using SP-IOP-RISK-SAA; 


Theorem 7. Suppose and R1-R4 and IC hold. Ifa = 0{n ^/(sm+i)^^ ^ = and 

On is computed using the semiparametric algorithm (i.e., Algorithm^^ ; then On is nearly-optimal 
for lOP-SAA in probability. 

Proof. Note that min{—fi(A,n,d) |A > 0} = —f{S{u,0),u,0) by strong duality (which holds 


because of A1,R1 (Bonnans and Shapiro 2000)). Next, consider the function 

R{O)=E(^mmf{S{u,0o),u,O)-h{\,u,O)^ =E(^f{S{u,0o),u,0) - f{S{u,0),u,0)y (51) 

its sample average approximation 

Rn{0) = “X] (min/(5(ni,6'o),«o6') - fi(Ai,'Ui,6')^ ^ n ^ {^f{‘S{ui,Oo),Ui,0) - f{S{u,,O),Ui,0)^ , 

i—1 ~ 1=1 

(52) 

and its semiparametric approximation 

- 71 1 

Rn{0) = - ^ (mmf{x,,u„0) - h{\i,u„0) \ = - ^ (f{xi,Ui,0) - f{S{u„0),Ui,0)). (53) 

ITj \ / 7^ V / 

i—1 1=1 

Note that min{i?„(d) | d G 0} is simply a reformulation of SP-IOP-RISK-SAA. Next, observe that 
E[f{S{u,Oo),u,O) — f{S{u,O),u,O)\u^l{{0)] >0 since (i) f{x,u,0) is twice continuously differen¬ 
tiable in X by R3, and (ii) dist(5(n,d),5(n,do)) > 0 for each u ^U{0) by IC. Consequently, we 
have R{0) > 0 for d G 0 \ do- As shown in the proof for Proposition]^ S{u,0) is continuous in d. 
And so Rn{0) and Rn{0) are continuous because (i) f{x,u,0) is twice continuously differentiable 
in x,d by R3. 

Next, recall that U is bounded by R4, 0 is bounded by R2, f{x,u,0) is twice continuously 
differentiable in x,0 by R3, and the feasible set of FOP is absolutely bounded by Rl. This 
means there exists L>0 such that for all d G 0 we have max^gj^] |/(x, n^d) — f{S{ui,0o),Ui,0)\ < 
Ln~^AiSm) -^j^enever maxig[„] \\xi — 5(Mi,do)|| < (which occurs with probability at least 

1 — ki exp(—A: 2 n^/'‘) by Proposition]^. Thus, we have that sup^gg |i?„(d) — d?„(d)| -A). 0. Now con¬ 
sider any d„ G argmin{i?„(d) | 0 G 0}, and note that the estimate d„ returned by the semiparametric 
algorithm satisfies this property by construction. By definition we have i?„(d„) < i?„(do), which 
can be rewritten as 

Rn{0n) + Rn{0n) ~ Rn{0n) < Rn{0o) + Rn{Oo) ~ Rn{0o)- 


( 54 ) 
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Thus, we have 


Ryiifin) < Rn{(^o) + \Rn{Gn) ~ Rn{Qn)\ + |-Rn(^o) ~ -Rn(^o)|- 


(55) 


We have thus shown all the conditions required to apply Theorem 5.14 of (van der Vaart 2000), 
which gives 9q. Now let G argmin{Q„(0) | 0 G 0}. By Theorem]^ we have Oq- This 


means that \6n — 9n 


• 0 . □ 


5. Numerical Experiments 

We present numerical results that demonstrate the statistical consistency of our algorithms for 
inverse optimization with noisy data, and the results show our algorithms perform competitively 
against KKA (Keshavarz et al. 2011) and VIA ( ]Bertsimas et al. 2014). We begin by conducting 
two types of tests using synthetic data. The first type is where the model is kept fixed and the 
number of data points increases, and the purpose is to display either estimation consistency or 
risk consistency of our algorithms. The second type is where the number of data points is kept 
fixed and the number of the parameters in the model increases, and the purpose is to display the 
feasibility of using our algorithms on large problems and to show the effect of data sampling and 
model complexity on statistical performance of our algorithms. Next, we apply our framework to 
a real data set in order to estimate a utility function that describes the tradeoff made between 
occupant comfort and the amount of energy consumption, when setting a thermostat temperature 
setpoint for air-conditioning. 


5.1. Synthetic Data and Enumeration Algorithm 

In the first experiments, we generate data using a given FOP and then use the same set of equations 
in SAA-IOP. In other words, the first set of experiments are situations where the model whose 
parameters are being identified exactly matches the model that generates the data. As a result, 
this setting consists of situations where IC is satisfied. The first example is where: (i) FOP-A 
is min{(0 + ri) • x | a; G [—1,1]}, (ii) u has a uniform distribution with support [—1,1], (iii) the 
measurement noise w has a normal distribution with zero mean and unit variance, (iv) the data is 
generated with 9q = 1, and (v) the enumeration algorithm (i.e., Algorithm was applied with e = 
0.001, S = 0.01, and 0 = [—1,1]. The second example is where: (i) FOP-B is min{x^ — (9-l-u) -x | x G 
[ 0 , 1 ]}, (ii) u has a uniform distribution with support [ 0 , 2 ], (iii) the measurement noise w has a 
normal distribution with zero mean and unit variance, (iv) the data is generated with 9o = and 
(v) the enumeration algorithm (i.e.. Algorithmwas applied with e = 0, d = 0.01, and 0 = [0,2 ]. 

The results averaged over 100 repetitions of sampling n G {10,30,50,100,300,500,1000} data 
points and then estimating the parameter 9 are summarized in Table We label the enumeration 
algorithm (i.e., Algorithm as ENA in the table. These results display estimation consistency 
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Table 1 Estimation Error \9n — 0Q\ Using Different Algorithms 


n 10 30 50 100 300 500 1000 

Data: FOP-A 
Model: FOP-A 

ENA 0.2616 0.0926 0.0380 0.0211 0.0055 0.0030 0.0009 

KKA 0.8686 0.8293 0.8182 0.8257 0.8130 0.8231 0.8170 

VIA 0.5552 0.4976 0.4829 0.4887 0.4807 0.4846 0.4780 

Data: FOP-B 
Model: FOP-B 

ENA 0.4577 0.2481 0.1510 0.0501 0.0222 0.0123 0.0063 

KKA 0.5065 0.2281 0.1595 0.0751 0.0398 0.0342 0.0238 

VIA 0.9488 0.7051 0.6344 0.4284 0.3145 0.3810 0.2962 



e, 

(a) ENA 



e, 

(b) KKA 



e, 

(c) VIA 


Figure 1 Scatter plot comparing estimated parameter 6n versus true parameter 9o as computed by different 
algorithms at n = 1,000 when the data and model are both FOP-A. 



1 

do 


(a) ENA 



(b) KKA 



e, 

(c) VIA 



Oo 

(d) SPA 


Figure 2 Scatter plot comparing estimated parameter On versus true parameter 6o as computed by different 
algorithms at n = 10,000 when the data and model are both FOP-B. 


of the enumeration algorithm since estimation error is decreasing to zero. To further illustrate 
estimation consistency, we conducted an experiment with the two examples above where the data 
was generated with a Oo that was randomly chosen from a uniform distribution with support [—1,1] 
and [0,2] for the first and second examples, respectively. A plot comparing the estimates to the 
true parameter 9q for the first situation when n = 1,000 is shown in Figure and a plot comparing 
the estimates 9n to the true parameter 9o for the second situation when n = 10,000 is shown in 
Figure Consistent estimates should line up along the diagonal, and hence these plots demonstrate 
the estimation consistency (inconsistency) of the enumeration algorithm (KKA and VIA). 
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Table 2 Normalized Prediction Error Q{0n) — \iar{w) Using Different Algorithms 



n 

10 

30 

50 

100 

300 

500 

1000 

Data: 

Model: 

FOP-C 

FOP-B 

ENA 

0.0216 

0.0184 

0.0162 

0.0150 

0.0065 

0.0046 

0.0017 

KKA 

0.0168 

0.0124 

0.0128 

0.0151 

0.0150 

0.0150 

0.0132 

VIA 

0.0249 

0.0185 

0.0196 

0.0149 

0.0089 

0.0072 

0.0042 

Data: 

Model: 

SQR-1 

FOP-B 

ENA 

0.0294 

0.0217 

0.0152 

0.0110 

0.0073 

0.0041 

0.0024 

KKA 

0.0394 

0.0389 

0.0398 

0.0440 

0.0504 

0.0525 

0.0518 

VIA 

0.0343 

0.0287 

0.0243 

0.0187 

0.0122 

0.0084 

0.0072 


In the second set of experiments, we generate data using a given model that is different than 
the FOP used to formulate SAA-IOP. In other words, this set of experiments are situations where 
the model whose parameters are being identified does not match the model that generates the 
data. As a result, this setting consists of situations where IC is not satisfied. The first example 
is where: (i) the data is generated by FOP-C which is min{| ■ x'^ — (l + u) ■ x \ x G [0,1]}, (ii) the 
model estimated by lOP-SAA is FOP-B, (iii) u has a uniform distribution with support [0,5], (iv) 
the measurement noise w has a normal distribution with zero mean and unit variance, and (v) 
the enumeration algorithm (i.e.. Algorithm was applied with e = 0, S = 0.01, and 0 = [0,2]. 
The second example is where: (i) the data is generated by the statistical model SQR-1 given by 
Hi = min{max{y^, 0}, 1} TrCi, (ii) the model estimated by lOP-SAA is FOP-B, (iii) u has a uniform 
distribution with support [0,5], (iv) the measurement noise w has a normal distribution with zero 
mean and unit variance, and (v) the enumeration algorithm (i.e., Algorithmic was applied with 
e = 0, 5 = 0.01, and 0 = [0,2]. 

The results averaged over 100 repetitions of sampling n G {10,30,50,100,300,500,1000} data 
points and then estimating the parameter 9 are summarized in Table and these results are 
normalized by subtracting veii{w). The reason for this normalization is that the prediction error 
E((y — ,^(n))^)) of the prediction ^(n) of the true model (either FOP-C or SQR-M, respectively) is 
var{w) because y = ^{u) + w here. The enumeration algorithm has lower prediction error because 
it is risk consistent, whereas KKA and VIA are not risk consistent. 

5.2. Synthetic Data and Semiparametric Algorithm 

Next, we generate data using a given FOP and then use the same equations in SAA-IOP. These 
experiments are situations where the model whose parameters are being identified exactly matches 
the model that generates the data. As a result, this setting consists of situations where IC is 
satisfied. The first example is where: (i) FOP-D is min{x'x — (0-F u)'x | xG [0,1]^}, (ii) u has a 
uniform distribution with support [0,2]^, (iii) the measurement noise w has a jointly Gaussian 
distribution with zero mean and identity covariance, (iv) the data is generated with p = 10 and 
9o G such that {9o)k = | for all k€[p], and (v) the semiparametric algorithm (i.e.. Algorithmic 
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was applied with 7 ,cr chosen using cross-validation (Hastie et al. 2009) and 0 = [0,2], The second 
example is where: (i) FOP-E is 


min{- ELi('^)fc • log((^)fe + {'>Ak) - log((x)p+i + {u)p+i) \ {x)k > 0 , ELi(^)fc = (^ 6 ) 

(ii) u has a uniform distribution with support [ 1 , 2 ]^'*'^, (iii) the measurement noise w has a jointly 
Gaussian distribution with zero mean and identity covariance, (iv) the data is generated with p = 10 
and 9o G such that {9o)k = 1 for all k £ [p], and (v) a modified version of the seimparametric 
algorithm (i.e., Algorithm]^ was applied with 7,(7 chosen using cross-validation ( Hastie et al.[2009 ) 
and 0 = [|, 2 ], The modification to Algorithmis we calculate Xi = mina;{||xi — x|| | {x)k > 0} and 
then compute using SP-IOP-RISK-SAA with the x^ replacing the x^. The Xi are the projection of 
the Xi onto the nonnegative orthant, and it turns out this projection does not affect our theoretical 
results. In particular, a short proof using the continuous mapping theorem ( van der Vaart|2000 ) and 
the boundedness of the feasible set in R1 gives that maxig[„] ||xi — 5(tij,0o)|| 0- The projection 

is needed for this particular example because otherwise we would have logarithms of negative 
numbers, which are complex-valued. More generally, a projection of Xi onto the feasible set of FOP 
will not affect our theoretical results, and can be added as a step in our semiparametric algorithm. 

The results averaged over 100 repetitions of sampling n G {10,30,50,100,300,500,1000} data 
points and then estimating the parameter 9 are summarized in Table|^ We label the semiparametric 
algorithm (i.e.. Algorithm]^ as SPA in the table. These results display estimation consistency of 
the semiparametric algorithm since it has lower estimation error as the data increases. To further 
illustrate estimation consistency, we conducted an experiment with the two situations above where 
the data was generated with p = l and a 9^ that was randomly chosen from a uniform distribution 
with support [0,1] and [|,2] for the first and second situations, respectively. A plot comparing the 
estimates 9^ to the true parameter 9q for the first situation when n = 1,000 is shown in Figure and 
a plot comparing the estimates to the true parameter 9q for the second situation when n = 1,000 
is shown in Figure Consistent estimates should line up along the diagonal, and hence these plots 
demonstrate the estimation consistency (inconsistency) of the semiparametric algorithm (KKA 
and VIA). It is worth comparing the results of the semiparametric and enumeration algorithms. 
As mentioned above, the semiparametric algorithm will generally have higher estimation error - 
this can be observed in these plots because the semiparametric algorithm estimates have a larger 
variation about the diagonal than the estimates of the enumeration algorithm. 

In the second set of experiments, we generate data using a given model that is different than 
the FOP used to formulate SAA-IOP. In other words, this set of experiments are situations where 
the model whose parameters are being identified does not match the model that generates the 
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Table 3 Estimation Error | —Using Different Algorithms 



n 

10 

30 

50 

100 

300 

500 

1000 

Data: 

Model: 

FOP-D 

FOP-D 

SPA 

2.4618 

1.7025 

1.2543 

0.8535 

0.4754 

0.3750 

0.2573 

KKA 

2.2569 

1.5513 

1.2229 

0.9281 

0.6107 

0.5435 

0.4447 

VIA 

3.3829 

3.2603 

3.1937 

3.1501 

3.0292 

3.0324 

2.9208 

Data: 

Model: 

FOP-E 

FOP-E 

SPA 

0.9189 

0.7982 

0.7500 

0.7487 

0.6639 

0.6070 

0.5783 

KKA 

1.6687 

1.5850 

1.5813 

1.5865 

1.5828 

1.5806 

1.5811 

VIA 

1.9299 

1.6781 

1.6826 

1.6132 

1.6001 

1.5973 

1.5843 




% Oo 

(a) ENA (b) KKA 

Figure 3 Scatter plot comparing estimated parameter 




% Oo 

(c) VIA (d) SPA 

versus true parameter 6q as computed by different 


algorithms at n = 1,000 when the data and model are both FOP-E with 1. 


data. As a result, this setting consists of situations where IC is not satisfied. The first example 
is where: (i) the data is generated by FOP-F which is min{| • x'x' — (1 + u)'x \ x G [0,1]^°}, (ii) 
the model estimated by lOP-SAA is FOP-D with p = 10, (hi) u has a uniform distribution with 
support [0,5]^°, (iv) the measurement noise w has a jointly Gaussian distribution with zero mean 
and identity covariance, and (v) the semiparametric algorithm (i.e.. Algorithm]^ was applied with 
7 , a chosen using cross-validation ( Hastie et al.||2009 ) and 0 = [0,2]. The second example is where: 

(i) the data is generated by the statistical model SQR-P given by = min{max{y^, 0 }, 1 } -FrCi, 

(ii) the model estimated by lOP-SAA is FOP-D with p = 10, (hi) u has a uniform distribution with 
support [0,5]^°, (iv) the measurement noise w has a jointly Gaussian distribution with zero mean 
and identity covariance, and (v) the semiparametric algorithm (i.e.. Algorithm]^ was applied with 
7 ,cr chosen using cross-validation (Hastie et al. 2009) and 0 = [0,2]. 

The results averaged over 100 repetitions of sampling n G {10,30,50,100,300,500,1000} data 
points and then estimating the parameter 6 are summarized in Table and these results are 
normalized by subtracting K{w'w). The reason for this normalization is that the prediction error 
E(||y — ^(n)|p)) of the prediction ^{u) of the true model (either FOP-C or SQR-M, respectively) is 
K{w'w) because y = ^{u) + w here. The enumeration algorithm has lower prediction error because 
it is risk consistent, whereas KKA and VIA are not risk consistent. 

In the third set of experiments, we generate data using the previous four settings. The difference 
in this set of experiments is that we fix n = 1000 and vary p G {1,3,5,10,30}. The results when 
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Table 4 Normalized Prediction Error Q{9n) —¥,{w'w) Using Different Algorithms 



n 

10 

30 

50 

100 

300 

500 

1000 

Data: 

Model: 

FOP-C 

FOP-D 

SPA 

0.2319 

0.1972 

0.1744 

0.1501 

0.1029 

0.0844 

0.0529 

KKA 

0.1584 

0.1308 

0.1314 

0.1349 

0.1452 

0.1497 

0.1481 

VIA 

0.3438 

0.3407 

0.3360 

0.3205 

0.2950 

0.2816 

0.2811 

Data: 

Model: 

SQR-M 

FOP-D 

SPA 

0.4180 

0.3497 

0.3195 

0.2470 

0.1572 

0.0998 

0.0658 

KKA 

0.3645 

0.3885 

0.3987 

0.4537 

0.5115 

0.5114 

0.5214 

VIA 

0.3468 

0.2784 

0.2737 

0.2524 

0.2405 

0.2458 

0.2599 


Table 5 Estimation Error \\6n — 6o\\ Using Different Algorithms 



P 

1 

3 

5 

10 

30 

Data: 

Model: 

FOP-D 

FOP-D 

SPA 

0.0601 

0.1464 

0.1907 

0.2794 

0.4701 

KKA 

0.1178 

0.2349 

0.3038 

0.4619 

0.7978 

VIA 

0.4943 

1.2254 

1.8099 

2.9522 

5.7737 

Data: 

Model: 

FOP-E 

FOP-E 

SPA 

0.0251 

0.1258 

0.2571 

0.5890 

0.5576 

KKA 

0.5000 

0.8660 

1.1174 

1.5804 

2.7377 

VIA 

0.5000 

0.8691 

1.1231 

1.5966 

2.7628 


Table 6 Normalized Prediction Error Q(0„) — E(w'ui) Using Different Algorithms 



P 

1 

3 

5 

10 

30 

Data: 

Model: 

FOP-C 

FOP-D 

SPA 

0.0064 

0.0171 

0.0403 

0.0628 

0.2048 

KKA 

0.0538 

0.1553 

0.2619 

0.5252 

1.5712 

VIA 

0.0078 

0.0175 

0.0745 

0.2602 

0.9654 

Data: 

Model: 

SQR-M 

FOP-D 

SPA 

0.0056 

0.0194 

0.0319 

0.0606 

0.1568 

KKA 

0.0148 

0.0471 

0.0761 

0.1523 

0.4394 

VIA 

0.0055 

0.0273 

0.0821 

0.2848 

1.2896 


the data/model are given by FOP-D/FOP-D and FOP-E/FOP-E, averaged over 100 repetitions 
and then estimating the parameter 0, are snmmarized in Table These results show that the 
semiparametric algorithm has lower estimation error than KKA and VIA on these examples. The 
results when the data/model are given by FOP-C/FOP-B and SQR-M/FOP-B, averaged over 100 
repetitions and then estimating the parameter 9, are summarized in Table These resnlts show 
that the semiparametric algorithm has lower prediction error than KKA and VIA on these examples. 


5.3. Empirical Data: Estimating an Energy-Comfort Utility Function 

We next apply our inverse optimization framework to the problem of estimating a utility function 
that describes the tradeoff made between occupant comfort and the amount of energy consumption, 
when setting a thermostat temperature setpoint for air-conditioning. The data we use is collected 
from Sutardja Dai Hall on the Berkeley campns, which was used as part of the BRITE-S testbed 
in our past experiments (Aswani et al. 2012a||b||c) concerning robust learning-based optimization 


(Aswani et al. 2013) of heating, ventilation, and air-conditioning (HVAC) systems. Specifically, 


this building is eqnipped with a commercial web application (Bnilding Robotics 


2016) that allows 
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Table 7 Prediction Error Q{9n) Using Different Algorithms 



n 

10 

30 

50 

100 

300 

500 

1000 

Data: 

Model: 

SDH-E 

FOP-S 

ENA 

1.3656 

1.3308 

1.3255 

1.3169 

1.3112 

1.3099 

1.3090 

KKA 

2.2439 

2.2528 

2.2508 

2.2351 

2.2225 

2.2220 

2.2200 

VIA 

2.2975 

2.2538 

2.2472 

2.2277 

2.2163 

2.2166 

2.2138 


occupants to change the thermostat temperature setpoints in real-time, and so the setpoints are 
changed throughout the year by occupants in response to factors like the outside weather. 

When a room is being cooled, a lower temperature setpoint requires increased energy consump¬ 
tion since the air-conditioner must provide more cold air; however, the purpose of air-conditioning 
is to improve comfort by lowering the room temperature. And so individuals must tradeoff comfort 
and energy consumption when choosing the setpoint. A simplified utility function model (expressed 
as minimization of the negative of the utility function) that captures this tradeoff is FOP-S: 


min{(0)i • {x — 76)^ + {x — { 6)2 — u)^ \ x G [70, 76]}, 


(57) 


where x G M is the thermostat temperature setpoint in units of degrees Fahrenheit (°F), and u G M 
is the current outside temperaure in degrees Fahrenheit (°F). The term (x — { 6)2 — uY indicates a 
preference for a temperature setpoint that is a fixed amount { 6)2 above the outside temperature u 
(i.e., the preferred temperature is { 6)2 + ^i), and the reason for this term is that individuals prefer 


a higher indoor temperature as the outside temperature increases (ASHRAE 2013). The term 
(0)i • (x — 76)^ indicates a preference for a higher setpoint because of energy considerations, and 
the number 76 is used because 76°F-78°F is a relatively high setpoint temperature that is often 
recommended for saving energy. The parameter { 6)1 quantihes the tradeoff between the preference 
for a higher setpoint to save energy versus the desired indoor temperature { 6)2 -F u. Lastly, the 
constraints x G [70, 76] indicate observed setpoint limits. 

The results averaged over 100 repetitions of sampling n G {10,30,50,100,300,500,1000} data 
points and then estimating the parameters 9 are summarized in Table The data set (which we 
label SDH-E in the table) used consists of outside temperature measurements (i.e., the u variable) 
and the chosen temperature set point (i.e., the x variable) of a single thermostat in Sutardja Dai 
Hall. In each repetition, the full data set was randomly split into a 1,000 point training data set 
and a 14,500 point testing data set. The n data points were randomly chosen from the training 
data set, and the prediction error of the estimated parameters were computed using the testing 
data set. To evaluate the statistical significance of the computed results, a bootstrap hypothesis 


test (Efron and Tibshirani 1994) was conducted. The computed p-value was less than 0.01, which 


indicates that the improved performance of the enumeration algorithm is statistically significant. 
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6. Conclusion 

We developed and analyzed a formulation for inverse optimization in the setting where noisy 
measurements of the optimal points of a convex optimization problem are available. Our approach 
requires solving a bilevel program, and we defined a new duality-based reformulation to convert 
this bilevel program into a single level program. We showed that existing heuristics for solving 
the problem of inverse optimization with noisy data are statistically inconsistent, whereas our 
formulation as a bilevel program leads to statistical consistency. Though our formulation is NP-hard 
to solve, we provided two numerical algorithms for solving our formulation and then demonstrated 
the improved estimates generated by our approaches through a series of numerical experiments 
involving both synthetic and empirical data. 
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Appendix 

A. Lemmas 

Lemma 1. Suppose R4 holds. Then for t> ci ■ j we have 

>i) < 2 exp (^- 2027172 ™ • (i-ci - 7 ) 2 ^, 

where Ci, C 2 > 0 are constants. 


(58) 


Proof. Recall fj.{u) is the probability density function of u, and note that 

Itt,] I = \Ku.) - 7 -"* /*„ K{^)f,{u)du\ 

= \T{ui)-T~’" lR^K{s)fj,{u,+js)j’^ds\ 

= \p.{u,) - K{s){fi{ui) + jWp{ui + /3'ys)'^s)ds\ (59) 

= I /»™ ^(s)V/r(u, + /37s)2’sds| -7 

< Cl • 7. 

where the second line follows from a change of variables s = {u — Ui)/ 7 , the third line follows from the 
multivariate form of Taylor’s Theorem with some /3 G [0,1], the fourth line follows because a Kernel function 
has the property / K{u)du = 1, and the fifth line follows by setting ci = max„gi^ | K{s)\7p.{u)'^sds\. Note 
this Cl term is finite because (i) a kernel function has the property that its support is finite (i.e., K{u) = 0 
for Ijull > 1), and (ii) p.{u) is a continuously differentiable probability density function by R4. Next, note 
that by Hoeffding’s inequality ( Vershyn'in||2012 1 we have for t > 0 that 


“(b""* • 7 E;=i K{^) - E [ 7 — iL(ii^) I > t) < 2 exp ( - , 


(60) 


where C 2 = (max„ ^(m))^. Combining (59) and (60) gives the desired result. □ 

Lemma 2. Suppose A1 and R1-R4 hold. Then for t > ■ 7^^^ -|_ q . ^ yjg have 

^(117"" ■ iY."=iyj ■ - ti{ui)S{ui,eo)\\ <2 exp (^-205/172"* • (t-cg -71/2 _C4.7)^ (61) 

where 03 , 04,05 >0 are constants. 

Proof. First, note that S{u,9) consists of a single point from the strict convexity assumption in R3. 
Next, note that having A1 and R1-R4 means that Proposition 4.41 of ( Bonnans and Shapiro|[2000 ) holds: 
This means for 7 > 0 sufficiently small we have 


||5(ii,6»o)-5('Ui,6io)|| <a•7^^^ 


(62) 


where a > 0 is a constant, whenever ||m — rti|| < 7 . Next, recall that pi conditioned on Ui has distribution 
S{ui,6Q) +Wi under IC. Moreover, we have 


E[ 7 -™ 2 /A(^) = £[7—5(71, eo)K{^) In,], 


(63) 
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(64) 


since E(wi) =0 and Wi is independent of Ui. Thus, we have 

= \\y{ui)S{ui,9o) - 7 "™ / r ,. K{^i^)y{u)S{u, 0Q)du\\ 

= \\fj.{ui)S{ui,9o)-j~’^J^rr,Kis)fJ-iui + ^s)S{ui+js,9oh’^ds\\ 

= \\fi{ui)S{ui,9o) - /r„ iir(s)(^(Mi) + 7 V^(ui +, 07 s)^s) {S{ui,9o) + 

S{ui + js,9o) -5(Mi,0o))(is|| 

= II K{s)n{u^){S{ui+js,9o)-S{ui,9Q))ds + itr(s)7V^(ui +/37s)^s5(u,0o)ds|| 

< C3 • 7^^^^ + C4 • 7, 

where the second line follows from a change of variables s = {u — Ui)/"f, the third line follows from the multi¬ 
variate form of Taylor’s Theorem with some /3 G [0,1], the fourth line follows because a Kernel function has the 
property / K(u)du = 1, and the fifth line follows from (62) and by setting C 3 = a • max^g^ | K{s)yL(u)ds\ 


and C 4 = max„gy(IK(s)V/r(u)^S(is| • ||5(u,0o)||). Note the 03,04 terms are finite because (i) a kernel 
function has the property that its support is finite (i.e., K{u) =0 for ||u|| > 1), (ii) ^(u) is a continuously 
differentiable probability density function by R4, and (iii) S{u,9q) is bounded by Rl. Next, note that y is 
a sub-exponential random variable ( Vershynin||2012 1 since (i) S{u,9q) is a bounded random variable by Rl, 
and (ii) w is sub-exponential by R4. Hence, by Hoeffding’s inequality for sub-exponential random variables 
( Vershyn'in ||2012 1 we have for t > 0 that 


"(II 7 -¥.[-! ’"2/K:(J^)|Mi]|| <2exp(-2c5n72™t^. 


(65) 


□ 


for some 05 > 0. Combining ( |64| ) and ( |65[ ) gives the desired result. 

B. Identifiability in Inverse Optimization 

Estimation consistency in any statistical setting (including inverse optimization with noisy data) requires 
that an identifiability condition holds, and such identifiability conditions can be stated under a variety of 
different mathematical formulations ( Wald|[l949 Jennrich]|1969 Bartlett and Mendelson|[2002 Greenshtein 
and Rito^|2004 Bickel and Doksum 2006 Chatterj^|2014 Aswani|[2015 1. The intuition for these different 
formulations is the same: Essentially, an identifiability condition states that the output of the model is 
different for two distinct sets of model parameters. It is important to note that identifiability is a statistical 
property of the model and the error metric used. Consequently, it is possible for an estimator to be statistically 
inconsistent, even when an identifiability condition holds (see for instance Proposition propositiomestincon). 
In the context of inverse optimization with noisy data, we define an identifiability condition IC. 

Showing that IC holds is complicated by the presence of constraints in FOP. To illustrate this, consider 
two related instances of FOP with x G K and 9 G [0,2]. The first min(a: — 9Y is FOP-I, and the second 
min{(a; — 9^ j a: < 1} is FOP-I I. Since these two problems are strictly convex, their minimizers are unique. 
Next, suppose we would like to estimate 9 given a (noiseless) measurement yi of the minimizer. Observe 
that FOP-I is identifiable because we must have 9 = yi. However, FOP-I I is not identifiable because if yi = 1, 
then we may have any 9 G [1,2]. Thus, the constraint x <1 renders FOP-II unidentifiable, and precludes the 
possibility of IC holding for FOP-II. 
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Though FOP-II is not identihable, a related problem is identihable because of external inputs. In particular, 
consider an FOP-MI with a; € M and 9 G [0,2] that is given by min{(x — 6 — u)^ \ a; < 1}. This problem is 
strictly convex, and so its minimizer is unique for each fixed value of u. In fact, the minimizer is given by 
iji = min{(0 -F Ui),l}. And so a sufficient condition for identifiability of FOP-III is if P(Mi < —1) > 0. For 
instance, if Ui = — 1 then yi = 9 — 1 and so 9 is uniquely determined by yi. The presence of the input parameter 
u ensures identifiability of FOP-III. 



